Introduction to statistical models

R and packages

  • “Base” R
    • Dates back to 1990s
    • Now at version 4.2.2
    • Evolves a bit, but nothing dramatic
  • Packages
    • Add-ons for extra functionality
    • Anyone can write: much more dynamic
  • RStudio
    • A nice front-end for R
  • (If possible, best to keep RStudio, R and packages up to date.)

“Tidy” data in R

  • “Tidy” has specific meaning, i.e. that data are in table form with
    • row = an observation (“object”)
    • column = a variable (“measurement type”)
    • each cell containing one value
IncidentID DataZone FF_Incident_Type Risk_Level DateCreated
5224360 S01008082 False Alarm (UFAS) 2 2014-09-11
7478338 S01009650 Other Primary Fire 3 2017-12-30
5716405 S01012100 False Alarm (Dwelling) 2 2015-06-24
7832270 S01009860 Dwelling Fire 5 2018-07-02
5921500 S01011244 Outdoor Fire 2 2015-10-15
  • Strong connections to SQL
  • Terminology: “Table”, “Data Frame”, “Tibble” are mostly interchangeable.

Tidyverse packages - dplyr

  • dplyr = “data plier”
  • Functions to manipulate data table(s):
    • select to pick columns
    • filter to pick rows
    • group_by and tally to summarise
    • mutate to add new columns
    • left_join, full_join, etc, to join different tables

Packages - installing and loading

  • If a package isn’t already installed we can install with:
install.packages("dplyr")
  • After it’s installed we can load with the library command:
library(dplyr)

This puts all the package’s functions, data sets, etc, into the “environment”.

  • We can use library(tidyverse) to load in all the tidyverse packages in one go (including dplyr)
  • In RStudio Tools -> Check for Package Updates then Select All and Install Updates keeps packages up to date.

Getting started for today

  • First we’ll do Session -> Set Working Directory to select a suitable folder (where we’ve put incidents.rds).
  • Then we’ll load the incidents data file:
incidents <- readr::read_rds('incidents.rds')
  • (The :: tells R to look for read_rds in the readr package.)
  • Then glimpse lets us see what we’ve loaded:
glimpse(incidents)

dplyr: filtering and selecting

filter(incidents, DataZone == "S01012100")
select(incidents, DateCreated)
select(filter(incidents, DataZone == "S01012100"), DateCreated)

(Tip: in RStudio pressing the TAB key often helpfully autocompletes function and variable names)

These commands are “standard” code. They work fine but can soon become hard to read if we make multiple function calls. This is where “piping” helps.

The pipe, |>

  • That previous command read out loud is: “Take incidents data then filter (down to a particular DataZone) then select the DateCreated variable.”
incidents |> filter(DataZone == "S01012100") |> select(DateCreated)
  • The pipe, |>, is read as “then”
  • Can string together “pipelines” of commands that remain readable
  • Pipelines are easy to “break up” when developing/checking code, e.g. copy-and-pasting the first bit(s) of it
  • Note: |> is now in “base R”, replacing the older %>% which previously needed a package

dplyr: grouping and summarising

  • group_by works in conjunction with another function that performs an operation “by group”
  • In our reports, we often use group_by with tally:
incidents |> group_by(DateCreated) |> tally() |> plot()

A note on workflow

  • In workflow to date we have:
    • worked with “data dump”s exported from SQL databases as CSV files
    • loaded these into local computer memory
    • performed calculations on local CPU
  • Issues with this include:
    • Data security
    • Data integrity/versioning
    • Storage/compute limitations
  • Much better (if possible) is to connect directly to SQL databases
    • dbplyr enables doing this and using dplyr code as if the data were local

Some examples from our reports

What do the following do?

(NB: acorn is a table with each row being a property)

acorn_tallied <- acorn |> group_by(DataZone) |> tally(name = 'n_properties')
incidents_tallied <- incidents |> 
  group_by(DataZone, ACORN_CAT, Risk_Level) |>  tally(name = 'n_incidents') 

Answers:

  1. Counts the number of properties in each data zone

  2. Counts how many incidents there were in each (DataZone, ACORN_CAT, Risk_Level) combination

Some things to try

  • Determine how many incidents there were of each risk level
  • Consider all the incidents in data zone S01012315
    • Filter to obtain these.
    • How many were there?
    • Of these, how many were of type “Dwelling Fire”?
  • Of all the incidents with risk level 5, find the number of fire casualties involved.
    • Pipe the output into table() to make a frequency table.
    • Pipe that outputted table into barplot() to show the data as a plot.